Introduction

In order to integrate a new feature of on/off target levels of coverage in NGSphy, I needed to understand coverage variation that one could find in a capture experiment. Goals are basically:

  1. Understand the coverage variation within samples and loci to be able to model it for both NGSphy parameterization and furthers simulations.
  2. Find out whether there is a correlation between the coverage and the phylogenetic distance to the reference species, used for the probe generation. Expecting that the closer the sample is to the reference the higher the coverage obtained ºis.

Data

Data used comes from Rute’s Hummingbird project. Most of the information reported here about the samples and targets was extracted from the paper.

Processing and storage

Workspace triploid (UVigo):

  • Original data from Dataset 1, is stored in: triploid.uvigo.es
  • Under the user folder: /home/merly/ research/cph-bam-coverage

Workspace randy (KU):

  • Original data from Dataset 1 and 2, is stored in: randy.popgen.dk
  • Under the user folder:
    • /home/merly/anna
    • /home/merly/swift

Targeted regions

Targeted regions are exons, retrieved a posteriori, due to the fact that the original probes for this project were lost during a flood.

Finally, datasets will be filter to match gene number.

birds48filename=paste0(WD,"48birds_ortholog.list.chi.anna.cpe.hum.finch")
swiftgff=paste0(WD,"Chaetura_pelagica.CDS.gff")
annagff=paste0(WD,"Calypte_anna.gene.CDS.2750.gff")
birds=read.table(birds48filename, header=T,
    colClasses=c("character","numeric","character","character",
      "character","character","character"))
swift=read.table(swiftgff,
    colClasses=c("character","character","character","numeric",
      "numeric","character","character","character","character"))
anna=read.table(annagff,
    colClasses=c("character","character","character","numeric",
      "numeric","character","character","character","character"))

birds.2=birds[birds$anna!="-",]
birds.3=birds.2[birds.2$swift!="-",]
birds.4=unique(birds.3)
birds.5=birds.4[,c("anna","swift")]
birds.5$swift=paste0("Parent=",birds.5$swift,";")
birds.5$anna=paste0("Parent=Aan_", birds.5$anna,";")
annaGenes=unique(anna$V9[birds.5$anna %in% anna$V9])
birds.6=birds.5[birds.5$anna %in% annaGenes, ]
swiftGenes=birds.6$swift

swift.2=swift[swift$V3=="CDS",]
swift.3=swift.2[swift.2$V9 %in% birds.6$swift,]
swift.3$size=swift.3$V5-swift.3$V4
swift.4=swift.3[swift.3$size>0,]
anna.2=anna[(anna$V5-anna$V4)>0,]
swift.4=swift.4[,1:(ncol(swift.4)-1)]
anna.3=anna.2[anna.2$V9 %in% unique(birds.6$anna),]
anna.3$targetname=paste0( anna.3$V1,
  rep("-", nrow(anna.3)),
  anna.3$V4,
  rep("-", nrow(anna.3)),
  anna.3$V5)
swift.4$targetname=paste0( swift.4$V1,
  rep("-", nrow(swift.4)),
  swift.4$V4,
  rep("-", nrow(swift.4)),
  swift.4$V5)

Description of the targeted regions

This is a quantitative summary description of the resulting target files. The difference between number of genes, and subsequently in number of targets and covered genome, in both Anna and Swift, is not precisely alarming. The experiment was done using known ortholog genes, which not necessarily match in number of exons (targets) or their size. There are two (2) Anna datasets, the original and the reduced (1,469 genes) and the Swift dataset.

Number of targets Size of targeted genome (bp) Number of genes Total of scaffolds
Anna - Original 24,044 3,934,867 2,750 389
Anna - Reduced 14,740 2,285,746 1,469 314
Swift 14,764 2,279,626 1,469 372

Number of targeted regions per gene

Number of targeted regions per scaffold

  • Scaffolds: where genes are located.

Gene size distribution

Min. 1st Qu. Median Mean 3rd Qu. Max.
Anna - Original 104 724 1,136 1,431 1,800 13,150
Anna - Reduced 110 842 1,285 1,556 1,953 13,150
Swift 110 843 1,281 1,552 1,943 13,000

Targeted regions size distribution

Min. 1st Qu. Median Mean 3rd Qu. Max.
Anna - Original 1 90 125 163.7 170 5,246
Anna - Reduced 1 90 125 163.2 169 5,246
Swift 1 89 123 154.4 166 5,240
  • These target size distribution are pretty much alike, but, there are some targeted regions up to ~5Kb. A question that can appear is whether these regions were expected (in terms of size). The idea with this experiment was to capture whole genes, and for this reason, theses sizes are not unexpected. Depth of coverage, which is what is being analyze here, depends more on probe feature decisions like size and density of the tiling.

On-target coverage

NOTE

Target datasets used from now on will contain the same number of genes (1,469).

I obtained the coverage of the targeted regions using bedtools[9] (v. 2.22.0), module coverage. With this, and the option -hist I can report a histogram of coverage for each feature in A as well as a summary histogram for _all_ features in A. Output (tab delimited) after each feature in A (from bedtools documentation))

  1. depth
  2. # bases at depth
  3. size of A
  4. % of A at depth
for bamfile in $(find $HOME/anna/originals -name "*.bam"); do
  tag=$(echo $(basename $bamfile) | tr "_" " " | awk '{print $1}')
  nohup bedtools coverage -hist -abam $bamfile -b "Calypte_anna.gene.CDS.2750.3.gff" | gzip > $HOME/anna/bedtools/cov/${tag}.cov.gz &
done
for bamfile in $(find $HOME/swift/originals -name "*.bam"); do
  tag=$(echo $(basename $bamfile) | tr "_" " " | awk '{print $1}')
  nohup bedtools coverage -hist -abam $bamfile -b  "Chaetura_pelagica.CDS.2.gff" | gzip > $HOME/swift/bedtools/cov/${tag}.cov.gz &
done

And so, I filtered the output to keep the coverage per region and the summary histogram separated.

# (i.e)
for tag in $(cat $HOME/anna/files/samples.txt  ); do
  nohup zcat  $HOME/anna/bedtools/cov/H09.cov.gz | grep -v ^all | gzip >  $HOME/anna/bedtools/nohist/$tag.nohist.gz &
  nohup zcat  $HOME/anna/bedtools/cov/H09.cov.gz | grep  ^all | gzip  >    $HOME/anna/bedtools/hist/$tag.hist.gz &
done
for tag in $(cat $HOME/swift/files/samples.txt | tail -n+2 | head -46 ); do
  nohup zcat  $HOME/swift/bedtools/cov/H09.cov.gz | grep -v ^all | gzip >  $HOME/swift/bedtools/nohist/$tag.nohist.gz &
  nohup zcat  $HOME/swift/bedtools/cov/H09.cov.gz | grep  ^all | gzip  >    $HOME/swift/bedtools/hist/$tag.hist.gz &
done

Breadth vs. depth

With the information extracted from the bedtools coverage -hist, we can see the relation between the breadth of coverage obtained and the depth per sample.

Map2 - target Coverage per target overview
map2Anna
map2Swift

Zoom: 0x-100x.

map2Anna map2Swift

Depth of coverage

From the bedtools coverage output I was able to extract the depth of coverage per target, gene and sample. I generated 2 matrices:

  • Target depth: dimensions – number of targets \(\times\) number of samples
  • Gene depth: dimensions – number of genes \(\times\) number of samples

Each cell of the matrix correspond to the sum of the number of times each base was covered within the target or the gene. Then, coverage was calculated as the mean value of all the depth of coverage of all the samples for a specific target/gene divided by the size of the corresponding target/gene.

For the depth of coverage for the samples, I summed the coverage of all targets and divided it by the total amount of bases that were targeted.

Note: Outliers presented below were calculated using the following:

qnt <- quantile(data, probs=c(.25, .75))
H <- 1.5 * IQR(data)
y <- data
y[data < (qnt[1] - H)] <- "Outlier"
y[data > (qnt[2] + H)] <- "Outlier"

Coverage per sample

Depth per sample

Min. 1st Qu. Median Mean 3rd Qu. Max.
map2Anna 2.115 47.76 77.73 101.3 115.8 310.2
map2Swift 1.715 38.27 62.29 83.8 98.07 265.6

Coverage distribution per sample

Coverage per gene

Min. 1st Qu. Median Mean 3rd Qu. Max.
map2Anna 12.05 72.38 98.38 105 133.5 290.3
map2Swift 0.8245 54.24 81.06 85.55 111.5 702.6

Outlier check - map2Anna

  • Identification and removal of the outliers

  • Number of outliers genes and coverage distribution

Num. outliers Min. 1st. Qu. Median Mean 3rd. Qu. Max.
map2Anna 21 225.2 237.3 251.2 255.5 272.8 290.3
map2Swift 19 199.1 207.3 215.7 246.4 234.3 702.6
  • Outlier genes and corresponding size
map2Anna Gene map2Anna Size map2Swift Gene map2Size Gene
Parent=Aan_R000557; 1025 Parent=Cpe_R000257; 1008
Parent=Aan_R000645; 1400 Parent=Cpe_R001479; 794
Parent=Aan_R000757; 1675 Parent=Cpe_R001695; 145
Parent=Aan_R000784; 1089 Parent=Cpe_R002655; 431
Parent=Aan_R000837; 2118 Parent=Cpe_R003268; 137
Parent=Aan_R001079; 884 Parent=Cpe_R003836; 461
Parent=Aan_R001728; 598 Parent=Cpe_R004636; 640
Parent=Aan_R001861; 604 Parent=Cpe_R005332; 1643
Parent=Aan_R001895; 3276 Parent=Cpe_R008000; 282
Parent=Aan_R002476; 896 Parent=Cpe_R008889; 709
Parent=Aan_R003858; 992 Parent=Cpe_R011470; 349
Parent=Aan_R004237; 3287 Parent=Cpe_R012383; 1214
Parent=Aan_R004624; 801 Parent=Cpe_R012980; 925
Parent=Aan_R005273; 508 Parent=Cpe_R013031; 753
Parent=Aan_R005275; 340 Parent=Cpe_R013115; 774
Parent=Aan_R006006; 360 Parent=Cpe_R013182; 2619
Parent=Aan_R006018; 584 Parent=Cpe_R013834; 743
Parent=Aan_R006633; 1151 Parent=Cpe_R014755; 608
Parent=Aan_R007365; 2134 Parent=Cpe_R014832; 1370
Parent=Aan_R007414; 1060
Parent=Aan_R008340; 964
  • There is no correspondence of gene outliers across datasets:
    • knownOrthologs[geneOutliersInAnna] \(\cap\) geneOutliersInSwift \(=\) \(\emptyset\)

Depth of coverage distribution per gene after removing outliers

Min. 1st Qu. Median Mean 3rd Qu. Max.
map2Anna 12.05 72.16 98.54 105 133.5 290.3
map2Swift 0.8245 54.06 80.49 83.44 109.8 195.4

Gene size vs. depth of coverage

I’m looking for a relation between gene size and depth of coverage. So I fit a linear model to the data.

## 
## Call:
## lm(formula = coveragePerGeneAnnaFiltered ~ genesAnnaFiltered$Size)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -96.192 -32.445  -5.977  28.228 178.940 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            111.955749   2.051036  54.585  < 2e-16 ***
## genesAnnaFiltered$Size  -0.004442   0.001065  -4.173 3.19e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.05 on 1454 degrees of freedom
## Multiple R-squared:  0.01183,    Adjusted R-squared:  0.01115 
## F-statistic: 17.41 on 1 and 1454 DF,  p-value: 3.19e-05

## 
## Call:
## lm(formula = coveragePerGeneSwiftFiltered ~ genesSwiftFiltered$Size)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -83.810 -29.445  -2.706  26.272 111.522 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             84.8002830  1.7674363  47.979   <2e-16 ***
## genesSwiftFiltered$Size -0.0008706  0.0009156  -0.951    0.342    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.58 on 1448 degrees of freedom
## Multiple R-squared:  0.000624,   Adjusted R-squared:  -6.614e-05 
## F-statistic: 0.9042 on 1 and 1448 DF,  p-value: 0.3418

CHECK THIS OUT!

I need to understand how to interpret this

Coverage per target

Min. 1st Qu. Median Mean 3rd Qu. Max.
map2Anna 0 34.12 84.28 167.6 178.3 42,890
map2Swift 0 23.04 64.59 136.6 144.9 56,610

Check

  • Identification and removal of the outliers

  • Number of outliers targets and coverage distribution

Num. outliers Min. 1st. Qu. Median Mean 3rd. Qu. Max.
map2Anna 1,066 0 31.09 75.25 102 150 394.5
map2Swift 1,096 0 20.64 57.1 80.72 119.5 327.2

Depth of coverage distribution per target after removing coverage outliers

Min. 1st Qu. Median Mean 3rd Qu. Max.
map2Anna 0 31.09 75.25 102 150 394.5
map2Swift 0 20.64 57.1 80.72 119.5 327.2

Size outliers removal

  • Target size distribution: Target size distribution has not change since the beginning of the analysis, current size distribution can be shown here. For a proper finding of correlation between size and coverage I had to remove both, size outliers and coverage outliers.

Coverage distribution after removing outliers

Num. outliers Min. 1st. Qu. Median Mean 3rd. Qu. Max.
map2Anna 810 0 34.75 80.78 106.3 155.3 394.5
map2Swift 814 0 23.12 60.83 84.11 124.2 327.2

Size distribution after removing outliers

Target size vs. depth of coverage

I’m looking for a relation between target size and depth of coverage.

## 
## Call:
## lm(formula = anna.6$Coverage ~ anna.6$Size)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -167.95  -65.13  -20.57   47.08  337.69 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 168.43121    2.13600   78.85   <2e-16 ***
## anna.6$Size  -0.48399    0.01548  -31.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 88.31 on 12742 degrees of freedom
## Multiple R-squared:  0.07122,    Adjusted R-squared:  0.07115 
## F-statistic: 977.1 on 1 and 12742 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = swift.7$Coverage ~ swift.7$Size)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.95  -56.29  -19.71   39.86  279.48 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  126.28067    1.81543   69.56   <2e-16 ***
## swift.7$Size  -0.32870    0.01316  -24.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 75.28 on 12734 degrees of freedom
## Multiple R-squared:  0.0467, Adjusted R-squared:  0.04663 
## F-statistic: 623.8 on 1 and 12734 DF,  p-value: < 2.2e-16

CHECK THIS OUT!

I need to understand how to interpret this

Overview

While there is no specific color key for the coverage values, this will only serve as an overview of the coverage per target/gene per sample.

  • Yellow maximum coverage
  • Red minimum coverage.

Genes not captured

  • This is a summary of the genes that were not totally captured by a single base, per sample. Meaning, coverage is below 1x.

  • Are these genes the same per sample?
    • No, they are not. This is the list of genes not captured in the Swift dataset, taking into account that the Anna dataset has all genes with coverage >0x.
    • H46 and H47 were samples with really low coverage, it can be expected to loose some genes when doing the capture.
    • Also for both datasets (full and without outliers) missing genes are the same
Full (1.469 Genes) Without Outliers
Parent=Cpe_R000823; Parent=Cpe_R000823;
Parent=Cpe_R008410; Parent=Cpe_R008410;
Parent=Cpe_R008909; Parent=Cpe_R008909;

Targeted regions not captured

This is a summary of the target regions that were not captured by a single base, per sample.

Dataset Number of common target regions per sample Percentage (common target/total target regions)
Within map2Anna samples Full dataset 268 1.81818181818182
Within map2Anna samples Outliers removed 268 2.10295040803515
Within map2Swift samples Full dataset 153 1.0363045245191
Within map2Swift samples Outliers removed 153 1.20131909547739
Across datasets Full Dataset 0 0
Across datasets Outliers removed 0 0
  • This shows that there are not common target regions across datasets.
  • Makes sense that they do not match, because the target file of the map2Swift was match (by genes) to the map2Anna target file, and coordinates are different
  • Makes sense that the number of targets not covered is lower in the map2Swift dataset, due to the lower number of targets in general, even though the percentages are similar (1%)
  • Also, the differences between the with/without outliers datasets are consistent to the removal of outliers
  • Common target regions are still the same.
  • I cannot find a possible reason why these regions where not captured, but the fact that such targets simply did not work.
  • Then, going through the sizes of those targets, it does not show much, mean size is ~130 bp, which is close to the size of the probes.

Min. 1st. Qu. Median Mean 3rd. Qu Max.
Anna 1 91 122 128.4 160 285
Swift 1 91 122 128.3 160 285

Off-target regions

Generation of on and off target datasets

First, is important to define what is an off-target region. For the purpose of this analysis, there are two (2) definition of an off-target region.

  1. The first definition considers off-target regions as the difference between the genome that can be captured and the regions in the targets file.
  2. The second definition considers off-target regions as a set difference between the genome that can be captured and the codon regions in Calypte_anna.gene.CDS.2750.2.gff and the regions in Chaetura_pelagica.CDS.2.gff, for their corresponding dataset.
    • This is the definition that is going to be used. This is due to the fact that codon regions in each file correspond to those genes that are known to be orthologs. There is a possibility that there might be potential orthologs within the rest of the genes, and so is possible that some of the reads may have mapped to those genes due to the similarity of the regions.

I’m assuming that the all the possible DNA captured (on/off-target) is defined by a set of scaffolds in each GFF file. The information that we know is the position of the codons per dataset (represented in the darker box, bottom), the Calypte_anna.gene.CDS.2750.gff file, for the map2Anna dataset and the Chaetura_pelagica.CDS.2.gff for the map2Swift dataset. These positions are relative to the scaffold that represents a region of the genome that was able to be assembled. We also have the relation of the codons to their corresponding genes.

Off-target generation

Off-target generation

So, in order to know what was the size of the off-target regions, I first needed to obtain the size of the possible DNA to be captured, meaning on and off target. To do so, I (per file):

  1. Grouped targets per scaffolds
  2. From those targets, extracted the highest end position, which would be the highest position possible per scaffold and therefore, the definition of the scaffold size.
  3. Generated a file, where I would have all the scaffold regions defined, considering that all scaffolds regions start at position 1, and end in their highest position.
WD="/media/merly/Baymax/research/cph-visit/coverage-analysis/"
allgenes=paste0(WD,"files/Calypte_anna.gene.CDS.gff")
anna=read.table(allgenes, colClasses = c("character","character","character",
                                         "numeric","numeric","character",
                                         "character","character","character"))
swiftfull=paste0(WD,"files/Chaetura_pelagica.CDS.gff")
swift=read.table(swiftfull, colClasses = c("character","character","character",
                                         "numeric","numeric","character",
                                         "character","character","character"))
sAnnaAll=split(anna,anna$V1)
sSwiftAll=split(swift,swift$V1)
maxPositionPerScaffoldAnna=sapply(sAnnaAll,function(x){max(x$V5)})
maxPositionPerScaffoldSwift=sapply(sSwiftAll,function(x){max(x$V5)})

maxLen=max(
  length(maxPositionPerScaffoldAnna),
  length(maxPositionPerScaffoldSwift))
unionScaffolds=union(
  names(maxPositionPerScaffoldAnna),
  names(maxPositionPerScaffoldSwift))
df=data.frame(
  anna=rep(0,length(unionScaffolds)),
  swift=rep(0,length(unionScaffolds)))
rownames(df)=unionScaffolds

df[names(maxPositionPerScaffoldAnna),]$anna=maxPositionPerScaffoldAnna[names(maxPositionPerScaffoldAnna)]
df[names(maxPositionPerScaffoldSwift),]$swift=maxPositionPerScaffoldSwift[names(maxPositionPerScaffoldSwift)]

dfAnna=df$anna[df$anna!=0]
names(dfAnna)=rownames(df)[df$anna!=0]
finalBed=cbind(names(dfAnna),rep(1,length(dfAnna)),dfAnna)
write.table(finalBed,paste0(WD,"files/scaffolds.anna.bed"), row.names = F,col.names = F,quote = F, sep="\t")

dfSwift=df$swift[df$swift!=0]
names(dfSwift)=rownames(df)[df$swift!=0]
finalBed=cbind(names(dfSwift),rep(1,length(dfSwift)),dfSwift)
write.table(finalBed,paste0(WD,"files/scaffolds.swift.bed"), row.names = F,col.names = F,quote = F, sep="\t")
  1. Used bedtools subtract (searches for features in B that overlap A. If an overlapping feature is found in B, the overlapping portion is removed from A and the remaining portion of A is reported) to removed all possible targets from the GFF files out of their corresponding scaffolds file, obtaining two (2) BED files, one per dataset.
bedtools subtract -a scaffolds.anna.bed -b $gffAnna > "$HOME/files/offtarget.anna.bed" &
bedtools subtract -a scaffolds.anna.bed -b $gffSwift > "$HOME/files/offtarget.swift.bed" &

General overview of scaffolds and resulting off-target regions

Number of scaffolds Total Genome Size (scaffolds) Number of resulting targets Total Genome Size (Off-target)
Anna 1,229 1,000,821,032 15,929 998,519,317
Swift 1,189 1,040,991,443 15,906 1,038,695,864

General overview of the size distribution of off-target regions

Unit Min. 1st. Qu. Median Mean 3rd. Qu Max.
Anna Scaffold 106 1,664 17,380 814,300 627,000 23,300,000
Swift Scaffold 103 8,954 93,180 875,500 767,800 19,070,000
Anna Target 1 523 1,089 62,690 3,092 5,756,000
Swift Target 1 553 1,151 65,300 3,544 5,739,000

Off-target regions across samples

The idea is to identify the off-target regions, as well as the size that covers and how is the coverage distribution within this regions compared to the on-target coverage.

To obtain coverage information from the off-target regions:

for bamfile in $(find $originalsAnna -name "*.bam" ); do
  tag=$(echo $(basename $bamfile) | tr "_" " " | awk '{print $1}')
  nohup bedtools coverage -hist -abam $bamfile -b "$HOME/files/offtarget.anna.bed" | gzip > $HOME/anna/bedtools/cov2/${tag}.cov.gz &
done
for bamfile in $(find $originalsSwift -name "*.bam"  ); do
  tag=$(echo $(basename $bamfile) | tr "_" " " | awk '{print $1}')
  nohup bedtools coverage -hist -abam $bamfile -b "$HOME/files/offtarget.swift.bed" | gzip > $HOME/swift/bedtools/cov2/${tag}.cov.gz &
done
#  hist split
for tag in $(cat $HOME/anna/files/samples.txt  ); do
  zcat $HOME/anna/bedtools/cov2/$tag.cov.gz | grep -v ^all | gzip > $HOME/anna/bedtools/nohist2/$tag.nohist.gz
  zcat $HOME/anna/bedtools/cov2/$tag.cov.gz | grep ^all | gzip > $HOME/anna/bedtools/hist2/$tag.hist.gz
done
for tag in $(cat $HOME/swift/files/samples.txt  ); do
  zcat $HOME/swift/bedtools/cov2/$tag.cov.gz | grep -v ^all | gzip > $HOME/swift/bedtools/nohist2/$tag.nohist.gz
  zcat $HOME/swift/bedtools/cov2/$tag.cov.gz | grep ^all | gzip > $HOME/swift/bedtools/hist2/$tag.hist.gz
done

Breadth vs. depth of coverage

I followed what has been done for the on-target regions to get this information. Getting how much of the off-target regions was captured, and at which depth.

map2Anna map2Swift
X-axis: 0-20 X-axis: 0-20
  • For the map2Anna dataset, for both off-target sets breadth vs. depth looks the same. Same situation with the mat2Swift dataset.
  • For the map2Anna, we see that the general depth is low, 25% of the off-target regions are covered at 1x for most of the samples.
  • For the map2Swift, we see that there are 3 situations:
    • Swift sample is the one that has more coverage in general (gray thick line)
    • Anna and H10 have a “medium” level coverage
    • Rest of the sample behaves in the same way as in the other dataset.
    • For Anna and Swift samples makes sense that the general off-target coverage is higher because these to sample datasets come from a whole-genome sequencing (WGS) experiment.

Depth of coverage

Coverage per sample

  • The 0x coverage for AnnaBGI and SwiftBGI in the map2Anna datasets, is because these samples were not in that mapping.
  • It is possible to see that the general coverage of the off-target regions is < 1x for all the samples. Less than 1% of the on-target coverages seen.

Coverage per target

Min. 1st. Qu. Median Mean 3rd. Qu Max. Var. Std. Dev
map2Anna 0 1.518 7.4 48.29 21.89 34,600 243,841.7 493.8033
map2Swift 0 0.6185 2.794 34.51 9.069 37,860 337,580.14666756 581.016477105048
  • Most of the targets are below 21x
  • The coverage among targets has a really high variability.

Relation within coverage and phylogenetic distance

I am trying to analyze if there is any correlation between the depth of coverage and the phylogenetic distance from the reference species, used for the probe generation and mapping.

Phylogenetic reconstruction

Information of the phylogenetic reconstruction I got from the Hummingbirds Paper, this says that it was made with mitochondrial genome and nuclear gene trees using either a concatenation approach with RAxML [3] or the multi-species coalescent approach implemented in ASTRAL [4] and ASTRAL-II [5].

The phylogenies were built using subsets of the nuclear genes present the same topology between the main groups of hummingbirds with high confidence. The three subsets correspond to:

  1. the 2949 genes that were successfully captured (in a minimum of 8 species)
  2. the 1987 genes for which we could assign a high confidence swift orthologs determined in [5] (used as a root)
  3. the 741 genes within the latter that produced trees with average support above 50%.

This is the Supplementary Table S5 (from the Hummingbirds Paper), describing the subsets selected:

number of genes minimum number of sites per species (concatenated) maximum number of sites per species (concatenated) includes outgroup ASTRAL score
741 949,470 1,542,750 yes 87%
1987 1,667,071 2,783,832 yes 78%
2949 2,473,664 3,792,500
Maximum likelihood tree

Maximum likelihood tree

  • Cpe, outgroup reference.
  • Aan, ingroup reference.

Depth of coverage vs. phylogenetic distance

For the analysis shown above as well as for phylogeny reconstruction using the dataset (1) with 2,949 genes.

Overview

To obtain the distance matrix I used the tree obtained with RAxML, from the concatenation of 2,750 genes. The tree gives me the branch lengths in number of substitutions per site. So, the distance calculated here, is the pairwise distance of the tips. This was done with the R package phangorn [7], and its cophenetic function.

sampleColors = colorRampPalette(brewer.pal(9, "Set1"))(48)
treefile="/home/merly/git/cph-visit/coverage-analysis/capture-phylotenetic-decay/trees/RAxML_bipartitions.2011.concat"
tree=read.nexus(treefile)
distMatrix=cophenetic(tree)
dCpe=distMatrix["Cpe",]
dAan=distMatrix["Aan",]
distMatrix[upper.tri(distMatrix)]=NA

Following the figure F3A in Braggs’s paper [8], where it shows sequencing coverage as a function of genetic divergence. I have calculated the depth of coverage for both datasets and phylogenetic distance. In the plot below you can see the median target coverage per sample.

## Warning in matrixAnna/targets$Size: longer object length is not a multiple
## of shorter object length

Detailed All

After genes with missing data were removed leaving 2,750 genes, and after matching those genes to the ones available in Swift, 1,469.

For the phylogenetic distance vs. coverage analysis, I will used a smaller dataset (2), with 1,987 genes, which I will intersect with those from which I do have the coverage information.

To do that, I generated a new dataset from the intersection of the genes found in Calypte_anna.gene.CDS.2750.3.gff (1,469), which are those ortholog genes matching across datasets, and the available gene trees (1,988).


cat "$WD/files/Calypte_anna.gene.CDS.2750.3.gff" | cut -f9 | uniq | tr "_" " " | awk '{print $2}' | tr ";" " " > g.trees.to.filter.txt

for gtreefile in $(cat $WD/files/g.trees.to.filter.txt); do
  cp "$WD/trees/gtrees/$gtreefile" "$WD/trees/gtrees.2/"
done
  • Gene trees not found: R001806, R005562, R009279, R014527, R014590.
  • Resulting on 1,464 genes.
  • Trees didn’t have branch lengths, I moved to the available multiple-sequence alignments (MSA), from which only 741 could be retrieved.
ls "$WD/trees/seqs" | tr "." " " | awk '{print $1}'  > "$WD/files/g.trees.to.filter.seqs.2.txt"
  • From those, I generated a smaller subset of gene trees MSAs.
mkdir "$WD/trees/seqs.2"
for gtreefile in $(cat $WD/files/g.trees.to.filter.seqs.2.txt); do
  filename="$WD/trees/seqs/${gtreefile}.x.aa.aln.backTranslated.gz"
  cp $filename "$WD/trees/seqs.2/"
done
  • I calculated the pairwise Hamming distance of the resulting gene trees MSAs, and extracted the vectors belonging to the distances to Anna and to Swift.
geneList=unlist(read.table(paste0(WD,"files/g.trees.to.filter.seqs.2.txt"), stringsAsFactors=F))
distancesAnna=matrix(rep(0,length(geneList)*length(sampleNamesSwift)), length(geneList), length(sampleNamesSwift))
distancesSwift=matrix(rep(0,length(geneList)*length(sampleNamesSwift)), length(geneList), length(sampleNamesSwift))
rownames(distancesAnna)=geneList; colnames(distancesAnna)=sampleNamesSwift;
rownames(distancesSwift)=geneList; colnames(distancesSwift)=sampleNamesSwift

gtreefiles <- list.files(
  path = paste0(WD,"trees/seqs.2/"),
  pattern="*.x.aa.aln.backTranslated")
gtreefiles=paste0(WD,"trees/seqs.2/",gtreefiles)
labelsCorrect=c("H05","H01","H08","H03","H04","H06","H02","H07")
labelsWronglyFormated=c("H5","H1","H8","H3","H4","H6","H2","H7")

for (i in 1:length(gtreefiles)) {
  print(i)
  msa=as.phyDat(read.FASTA(gzfile(gtreefiles[i])))
  hammDist=as.matrix(dist.hamming(msa))
  indexAnna=which(startsWith(rownames(hammDist),"Aan"))
  indexSwift=which(startsWith(rownames(hammDist),"Cpe"))
  namesIds=which(rownames(hammDist) %in% labelsCorrect)
  rownames(hammDist)[c(indexAnna, indexSwift, namesIds)]=c("AnnaBGI","SwiftBGI",labelsWronglyFormated)
  colnames(hammDist)[c(indexAnna, indexSwift, namesIds)]=c("AnnaBGI","SwiftBGI",labelsWronglyFormated)
  gene=strsplit(basename(gtreefiles[i]),"\\.")[[1]][1]
  distancesAnna[gene,colnames(distancesAnna)]=hammDist[indexAnna,colnames(distancesAnna)]
  distancesSwift[gene,colnames(distancesSwift)]=hammDist[indexSwift,colnames(distancesSwift)]
}
write.table(distancesAnna,file=paste0(WD,"files/distance.matrix.per.gene.anna.txt"), col.names=T,row.names=T)
write.table(distancesSwift,file=paste0(WD,"files/distance.matrix.per.gene.swift.txt"), col.names=T,row.names=T)
  • Afterwards, I filtered them to match those genes from which coverage was calculated. Leaving a total of 623 genes to work with.
distancesAnna=read.table(file=paste0(WD,"files/distance.matrix.per.gene.anna.txt"), header=T)
distancesSwift=read.table(file=paste0(WD,"files/distance.matrix.per.gene.swift.txt"), header=T)
birds48filename=paste0(WD,"files/48birds_ortholog.list.chi.anna.cpe.hum.finch")
birds=read.table(birds48filename, header=T,
    colClasses=c("character","numeric","character","character",
      "character","character","character"))

birds.2=birds[birds$anna!="-",]
birds.3=birds.2[birds.2$swift!="-",]
birds.4=unique(birds.3)
birds.5=birds.4[,c("anna","swift")]
birds.5$swift=paste0("Parent=",birds.5$swift,";")
birds.5$anna=paste0("Parent=Aan_", birds.5$anna,";")
annaGenes=unique(gffAnna$V9)
birds.6=birds.5[birds.5$anna %in% annaGenes, ]
birds.6$swift.plain=unlist(strsplit(unlist(lapply(strsplit(birds.6$swift,"_"), function(x){x[[2]]})), ";"))
birds.6$anna.plain=unlist(strsplit(unlist(lapply(strsplit(birds.6$anna,"_"), function(x){x[[2]]})), ";"))
smallbird=birds.6[birds.6$anna.plain %in% intersect(birds.6$anna.plain,geneList),]
distancesAnna=distancesAnna[smallbird$anna.plain,]
distancesSwift=distancesSwift[smallbird$anna.plain,]
rownames(distancesAnna)=smallbird$anna
rownames(distancesSwift)=smallbird$swift
  • Retrieved the coverage information per gene and filtered the coverage matrices to match the 623 resulting genes.
samplesFilenameAnna=paste0(WD,"anna/files/samples.txt")
samplesFilenameSwift=paste0(WD,"swift/files/samples.txt")
sampleNamesAnna=unlist(read.table(samplesFilenameAnna, stringsAsFactors = F))
sampleNamesSwift=unlist(read.table(samplesFilenameSwift, stringsAsFactors = F))
coverageMatrixFileAnna=paste0(WD,"anna/files/coverage.matrix.per.gene.txt")
coverageMatrixFileSwift=paste0(WD,"swift/files/coverage.matrix.per.gene.txt")
coverageMatrixAnna=read.table(coverageMatrixFileAnna, header=T)/genesAnna$Size
coverageMatrixSwift=read.table(coverageMatrixFileSwift, header=T)/genesSwift$Size
rownames(coverageMatrixAnna)=genesAnna$Gene
rownames(coverageMatrixSwift)=genesSwift$Gene
#-------------------------------------------------------------------------------
cmswift=coverageMatrixSwift[smallbird$swift,]
cmanna=coverageMatrixAnna[smallbird$anna,]
cmanna=cbind(cmanna,AnnaBGI=rep(0,nrow(cmanna)), SwiftBGI=rep(0,nrow(cmanna)))
  • Linear correlation was calculated after removing outliers in both coverage and distance
  • What is shown in the plot below is the relation between depth of coverage and phylogenetic distance per gene per sample. 623 genes were used. 48 Samples.
## 
## Call:
## lm(formula = finalCoverageSwift ~ finalDistanceSwift)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -82.60 -33.14 -11.46  22.18 161.79 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          85.2915     0.9318   91.54   <2e-16 ***
## finalDistanceSwift -346.9787    12.2923  -28.23   <2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## 
## Residual standard error: 46.16 on 25678 degrees of freedom
## Multiple R-squared:  0.0301, Adjusted R-squared:  0.03006 
## F-statistic: 796.8 on 1 and 25678 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = finalCoverageAnna ~ finalDistanceAnna)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -85.17 -39.53 -13.07  26.76 181.06 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        67.0040     0.7353   91.12   <2e-16 ***
## finalDistanceAnna 431.0792    36.8383   11.70   <2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## 
## Residual standard error: 54.25 on 24992 degrees of freedom
## Multiple R-squared:  0.005449,   Adjusted R-squared:  0.00541 
## F-statistic: 136.9 on 1 and 24992 DF,  p-value: < 2.2e-16
  • P-values show that there might be a linear relation between these 2 variables

Detailed - pairs per species

Now, the plots below show the relation between phylogenetic distance and coverage, of two (2) samples from the Cocquettes, Brilliants, Mangoes and Hermits groups, following the ML tree from the Hummingbirds paper (F1,A). These samples correspond to one with high coverage and one with low coverage.


References

  • [1]: Bleiweiss R, Kirsch JAW, Matheus JC (1994) DNA-DNA Hybridization Evidence for Sub-family Structure among Hummingbirds. Auk 111(1):8–19. DOI: 10.2307/4088500
  • [2]: McGuire JA, et al. (2014) Molecular phylogenetics and the diversification of hummingbirds. Curr Biol 24(8):910–6. DOI: 10.1016/j.cub.2014.03.016
  • [3]: Stamatakis A (2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30(9):1312–1313. DOI: 10.1093/bioinformatics/btu033
  • [5]: Mirarab S, Warnow T (2015) ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics 31(12):i44–52. DOI: 10.1093/bioinformatics/btv234
  • [6]: Jarvis ED, et al. (2014) Whole Genome Analyses Resolve the Early Branches in the Tree of Life of Modern Birds. Science (80- ) 346(6215):1320–1331. DOI: 10.1126/science.1253451
  • [8]: Bragg, JG., Potter, S., Bi. K. and Moritz, C. (2016) Exon capture phylogenomics: efficacy across scales of divergence Molecular Ecology Resources 16, 1059–1068. DOI: 10.1111/1755-0998.12449
  • [9] Aaron R. Quinlan Ira M. Hall (2010) BEDTools: a flexible suite of utilities for comparing genomic features Bioinformatics 26 (6): 841-842. DOI: 10.1093/bioinformatics/btq033
  • [10] Korneliussen, Thorfinn S. and Albrechtsen, Anders and Nielsen, Rasmus (2014) ANGSD: Analysis of Next Generation Sequencing Data BMC Bioinformatics 15:356. DOI: 10.1186/s12859-014-0356-4